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Abstract 



Several recent results in machine learning have established formal connections be- 
tween autoencoders — artificial neural network models that attempt to reproduce 
their inputs — and other coding models like sparse coding and K-means. This pa- 
per explores in depth an autoencoder model that is constructed using rectified lin- 
ear activations on its hidden units. Our analysis builds on recent results to further 
unify the world of sparse linear coding models. We provide an intuitive interpre- 
tation of the behavior of these coding models and demonstrate this intuition using 
small, artificial datasets with known distributions. 



1 Introduction 



Large quantities of natural data are now commonplace in the digital world — images, videos, and 
sounds are all relatively cheap and easy to obtain. In comparison, labels for these datasets (e.g., 
answers to questions like, "Is this an image of a cow or a horse?") remain relatively expensive 
and difficult to assemble. In addition, even when labeled data are available for a given task, there 
are often only a few bits of information in the labels, while the unlabeled data can easily contain 
millions of bits. Finally, most collections of data from the natural world also seem to be distributed 
non-uniformly in the space of all possible inputs, suggesting that there is some underlying structure 
inherent in a particular dataset, independent of labels or task. Furthermore, many unsupervised 
learning approaches assume that natural data even lie on a relatively continuous low-dimensional 
manifold of the available space, which provides further structure that can be captured in a model. 
For these reasons, much recent work in machine learning has focused on unsupervised modeling of 
large, easy to obtain datasets. 

An unsupervised model attempts to create a compact representation of the structure of a dataset. 
Some models are designed to learn a set of "basis functions" that can be combined linearly to create 
the observed data. Once such a set has been learned, this representation is often useful in other tasks 
like classification or recognition (but see |4 1 for interesting analysis). In addition to learning useful 
representations, many models that have resulted from this line of work have also been shown to have 
interesting theoretical ties with neuroscience and information theory (e.g., fT2l). 

In this paper, we first synthesize recent results from unsupervised machine learning, with a particular 
focus on neural networks and sparse coding. We then explore in detail the coding behavior of one 
important model in this class: autoencoders constructed with rectified linear hidden units. After 
describing the intuitive behavior of these models, we examine their behavior on artificial and natural 
datasets. 
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2 Background 



One way to take advantage of a readily available unlabeled dataset O is to learn features that encode 
the data well. Such features should then presumably be useful for describing the data effectively, 
regardless of the particular task. Formally, we wish to find a "dictionary" or "codebook" matrix 
D G R^x^ whose columns can be combined or "decoded" linearly to represent elements from the 
dataset. There are many ways to do this: for example, one could use random vectors or samples 
from O, or define some cost function over the space of dictionaries and optimize to find a good one. 

Once D is known, then for any input x, we would also like to compute the coefficients h G that 
give the best linear approximation x = Dh to x. This problem is often expressed as an optimization 
of a squared error loss 

h = argmin \\Du — x||2 + R{u) 

u 

where R is some regularizes If R = 0, for example, then this is linear regression; when R = \\ ■ ||i, 
the encoding problem is known as sparse coding or lasso regression 1 14 |. Sparse coding yields state- 
of-the art results on many tasks for a wide variety of dictionary learning methods 1 2 1, but it requires a 
full optimization procedure to compute h, which might require more or less computation depending 
on the input. To address this complexity, Gregor and LeCun |6| tried approximating this coding 
procedure by training a neural network (which has a bounded complexity) on a dataset labeled with 
its sparse codes. For the remainder of the paper, we consider a similar problem — learning an efficient 
code for a set of data — but focus on simpler autoencoder neural network models to avoid solving 
the complete sparse coding optimization problem. 

2.1 Autoencoders 

Autoencoders (H \15\ [TOl are a versatile family of neural network models that can be used to learn 
a dictionary from a set of unlabeled data, and to compute an encoding h for an input x. This 
learning takes place simultaneously, since changing the dictionary changes the optimal encodings. 
An autoencoder attempts to reconstruct its input after passing activation forward through one or 
more nonlinear hidden layers. The general loss function for a one-layer autoencoder can be written 
as 

= 7^ E 11^ + b)) - x||^ + R{0, b) 

where and cr(-) are activation functions for the output and hidden neurons, respectively, b G 
is a vector of hidden unit activation biases, and W G R^^"^ is a matrix of weights for the hidden 
units. By setting (^{z) = z, this family of models clearly belongs to the class of coding models, where 
h = a{W:>c + b). This encoding scheme is easy to compute, but it also assumes that the optimal 
coefficients can be represented as a linear transform of the input, passed through some nonlinearity. 

Commonly, the encoding and decoding weights are "tied" so that W = , which forces the 
dictionary to take on values that are useful for both tasks. Seen another way, tied weights provide an 
implicit orthonormality constraint on the dictionary |9|. Tying the weights also reduces by half the 
number of parameters that the model needs to tune. 

2.2 Rectified finear autoencoders 

Traditionally, neural networks are constructed with sigmoid activations like a{z) = (1 + e^)~^. 
The rectified linear activation function = max(0, z) has been shown to improve training and 
performance in multilayer networks by addressing several shortcomings of sigmoid activation func- 
tions |[TT]|5l. In particular, the rectified linear activation function produces a true (not just a small 
value) for negative input and then increases linearly. Unlike the sigmoid, which has a derivative 
that vanishes for large input, the derivative of the rectified linear activation is either for negative 
inputs, or 1 for positive inputs This "switching" behavior is inspired by the firing rate pattern seen 
in some biological neurons, which are inactive for sub-threshold inputs and then increase firing rate 
as input magnitude increases. The true output of the rectified linear activation function effectively 
deactivates neurons that are not tuned for a specific input, isolating a smaller active model for that 
input |5J. This also has the effect of combining exponentially many linear coding schemes into one 
codebook flTl . 
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(a) Classifier view. 



(b) Encoder view. 



Figure 1: Visualizing the half-spaces associated with data point x and feature vector w. In a linear 
classifier (left), weight vectors define hyperplanes that split data points into two groups. In a linear 
encoder (right), a data point x defines a hyperplane which is shifted in the direction opposite that 
data point through a distance b associated with feature w; the hidden unit corresponding to w is 
coded by the distance from w to the shifted hyperplane. 



To learn a dictionary from data, Glorot et al. proposed the single-layer rectified linear autoen- 
coder with tied weights by setting W = , (^{z) = z and a{z) = in the general autoencoder 
loss, giving 



Hidden units in rectified linear autoencoders behave similarly to the "triangle K-means" encoding 
proposed by Coates et al. Q, which produces a coefficient vector h for input x such that hi = 
[/i — ||x — c^ll where 1^ = ^ ^j=i 11^ ~ II mean distance to all cluster centroids. Like 
a rectified linear model, triangle K-means produces true activations for many (approximately half) 
of the clusters, while coding the rest linearly. An equivalent | 4 1 variation is the "soft thresholding" 
scheme |2|, where hi = [dfx — A] ^ for some parameter A. This coding scheme is equivalent to a 
rectified linear autoencoder with a shared bias A on all hidden units. Rectified linear autoencoders 
generalize all of the models in this "switched linear" class and provide a unified form for their loss 
functions. 



2.3 Whitening 

The switched linear family of models uses linear codes for some regions of input space and produces 
zeros in others. These models perform best when applied to whitened datasets. In fact, Le et 
al. I9) showed that sparse autoencoders, sparse coding, and ICA all compute the same family of 
loss functions, but only if the input data are approximately white. Empirically, Coates et al. d 
m observed that for several different single-layer coding models, switched-linear coding schemes 
perform best when combined with pre- whitened data. 

Whitening seems to be important in biological systems as well: Although the human eye does not 
receive white input, neurons in retina and LGN appear to whiten local patches of natural images 
|[T3l[3l before further processing in the visual cortex. 



3 Switched linear codes 

Many researchers (e.g., ||7|[TT1|3) have described how the rectified linear activation function sep- 
arates its inputs into two natural groups of outputs: those with values, and those with positive 
values. We belabor this observation here to make a further point about the behavior of rectified 
linear autoencoders below. 
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Figure 2: Rectified linear feature planes for a 3D gaussian dataset, using a complete dictionary 
(k = 3). With an undercomplete or 1 x complete dictionary, these planes tend to move to the bounds 
of the dataset, forming a new coordinate system. The image on the left is a full 3D view of the space, 
while the image on the right is a projection onto the x-y plane. 



Consider a data point x and each encoding feature (row) Wj of geometrically as vectors in W^, 
shown in Figure [T] In the context of a linear classifier, the weight vector w is often described as 
defining the normal to a hyperplane that divides the dataset into two subsets. A linear encoder, 
on the other hand, is attempting to determine which hidden units describe its input, so in this context 
it is the vector x that defines the normal to a hyperplane i^x passing through the origin. Each feature 
Wj in the autoencoder corresponds to a bias term bj ; this bias can be seen as translating i^x in the 
direction away from x through a distance of bj . After this translation, if Wj lies on the opposite 
side of H:^-^.^i). from x, then the hidden unit hj will be set to by a rectified activation function]^ 
In other words, rectification selectively deactivates features Wj in the loss for x whenever feature 
Wj points sufficiently far away from x|^ After hidden unit hj is deactivated, decoding entry Wj no 
longer has any effect on the output. We can thus define 

V^x {j ' WjX + bj > 0} 

as the set of features that have nonzero activations in the hidden layer of the autoencoder for input 
X. Then 

= [^'01, . • . , h^^] 

is the vector of hidden activations for just the features in i/j, is the matrix of corresponding 
columns from D, and are the corresponding bias terms. This gives a new loss function 

UiO) = ^ E 11^^ (I^^x + b^) -x||^ + Ab) 
that is computed only over the active features ipy^ for each input x. 

With b = and a sparse regularizer R = ||h||i, the loss £^ reduces to that proposed by Le et al. 
191 for ICA with a reconstruction cost. For whitened input, then, the rectified linear autoencoder 
behaves not only like a mixture of exponentially many linear models 1 1 1 1, but rather like a mixture 
of exponentially many ICA models — one ICA model is selected for each input x, but all features are 
selected from, and must coexist in, one large feature set. Importantly, this whitening requirement 
only holds for the region of data delimited by the set of active features, potentially simplifying the 
whitening transform. 

Mirrored by the geometric interpretation, from the perspective of a data point, nonzero bias terms 
can be thought of as translating the data to a new origin along the axis of that feature. Seen from 
the perspective of a feature, the bias terms can help move the active region for that feature to an 
area of space where data are available to describe — potentially ignoring other regions of space that 
contain other portions of the data. From either viewpoint, the bias parameters in the model can 

^Analytically, Zj = w^x + 6^ < 0, so hj = max(0, Zj) = 0. 
^ Where "sufficiently" is determined by the bias for that feature. 
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Figure 3: Sigmoid feature planes for a 3D gaussian dataset, using an overcomplete dictionary (k = 
6). The weights corresponding to each hidden neuron form the normal vector to each plane, and the 
distance from the origin is given by the bias for that unit. In the above example, three of the encoding 
planes have partitioned the major axis of the data distribution (perhaps to maximize effective coding 
regions), while the remaining planes capture variance along minor axes of the dataset. The image 
on the left is a full 3D view of the space, while the image on the right is a projection onto the x-y 
plane. 




Figure 4: Rectified linear feature planes for a 3D gaussian dataset, using an overcomplete dictionary 
(/c = 6). These feature planes group together in pairs, splitting the dataset along one axis. The image 
on the left is a full 3D view of the space, while the image on the right is a projection onto the x-y 
plane. 



thus help compensate for non-centered data. Figure |2] shows typical behavior for a 1 x complete 
dictionary of rectified linear units when trained on an artificial 3D gaussian dataset. In this case, 
each feature plane uses its associated bias to translate away from the origin, together providing a 
sort of "bounding box" for the data. Though the axes of the box are not necessarily orthogonal, the 
features define a new basis that is often aligned with or near the principal axes of the data. 

3.1 Effective coding region 

Visualizing feature hyperplanes is actually instructive for many types of activation functions. Fig- 
ure [3] for instance, shows the feature planes for a 2 x overcomplete tied- weights autoencoder with 
sigmoid hidden units, trained on an artificial 3D gaussian dataset]^ Several of the features learned 
by this model partition the data along the principal axis of variance, while the remaining features 
capture some of the variance along minor axes of the dataset. In comparison, a 2 x overcomplete 
rectified linear autoencoder trained on a 3D gaussian dataset (Figure [4]) creates pairs of negated fea- 
ture vectors, in effect providing a full- wave rectified linear output along each axis. Together, these 



^The feature planes in this case are aligned with the logistic activation value 0.5. 
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Figure 5: Rectified linear responses summed over input space for mixture of gaussian datasets. The 
plot on the left was trained with no regularization, while the plots in the middle and on the right 
were trained with an LI sparsity penalty on the hidden activations. Color delineates regions of equal 
potential summed over all hidden units, and small red circles indicate the learned feature vectors. 
Data points are shown in black. 



pairs of planes split the data into orthants, which produces a coding scheme that is 50 % sparse 
because only half of the features are active for any given region of the input space. 

Clearly these two activation functions have different ways of modeling the input data, but why do 
these particular patterns of organization appear? In our view, the unbounded linear output of the 
rectified linear autoencoder units addresses a subtle but important issue inherent in coding with 
sigmoid activation functions: scale. Consider, for example, a two-dimensional dataset consisting of 
points along a one-dimensional manifold 

H = {[x,ef :0<x<S} 

where e is gaussian noise. Regardless of the activation function, this network requires just one 
hidden unit to represent points in C: we can set the bias to and weights to [1,0]^ to align the 
feature of the hidden unit with the linear function that describes the data. Then points in the dataset 
are simply coded by their distance from the origin, as given by the output of the activation function. 
For a sigmoid activation function, however, as S increases, the activation function saturates, limiting 
the power of the hidden units to discriminate between small changes in x near S. This saturation 
limits the scale at which sigmoid units can describe data effectively, but it also prevents numerical 
instability by limiting the range of the output. 

Regardless of the activation function, then, each feature vector can be seen as normal to a hyperplane 
in input space. The activation function encodes data along the axis of the feature as a function of 
the distance to the hyperplane for that feature. Sigmoid activation functions saturate for points far 
from the hyperplane, whereas linear activations solve this "vanishing signal" problem by coding all 
X values with a linear output response. Rectified linear activations make a further distinction by 
coding only one half of the input space. However, the outputs of a linear activation are unbounded, 
so networks with inputs at large scales are prone to numerical instability during training. 

4 Behavior on complex data 

So far, the visualizations tools that we have used are based on simple gaussian distributions of 
data. Many interesting datasets, however, are emphatically not gaussian — indeed, one of the primary 
reasons that ICA is so effective is that it explicitly searches for a non-gaussian model of the data! 
In this section we present several visualizations for these models using more complex datasets: first 
we examine mixtures of gaussians, and then we use the MNIST digits dataset as a foray into a more 
natural set of data. 

We first tested the behavior of rectified linear autoencoders on a small, 2D mixture of gaussians 
dataset. (See Figure [5]) After training, even massively overcomplete dictionaries tended to display 
the feature pairing behavior described above, particularly when combined with a sparse regularizer, 
suggesting that these networks are capable of identifying the number of features required to code 
the data effectively. Unfortunately, while it provides simple visualization, working in 2D does not 
necessarily generalize to higher dimensions, as our intuition in low-dimensional spaces can quickly 
lead us astray in larger spaces. 



Figure 6: Summary plots of individual feature response regions for a rectified linear autoencoder 
trained on a mixture of gaussians dataset, including an LI sparsity penalty. Colored bands indicate 
equipotential regions of response for the feature, and data points are shown in black. Interestingly, 
most of the features are inactive for this dataset, with only three features devoted to coding important 
regions of the data. 
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Figure 7: PC A features (eigendigits) computed for MNIST digits. 



4.1 MNIST digits 

The MNIST dataset consists of 60000 28 x 28 grayscale images of the handwritten digits through 
9. Figure [7] shows the PC A "eigendigits" computed using this dataset; these digits point in the 
directions of highest variance for this dataset overall, and are often interpreted as a coding of the digit 
using a Fourier basis. Figure [8] shows the features with the highest bias values computed by a single- 
layer rectified linear autoencoder on the same dataset. For visualization purposes, each feature w 
on the left of each column is paired with its maximally negative feature v = arg min^ w^u from 
the dictionary. Many of the primary features in the dictionary are negative images of each other, 
indicating that the "pairing" of feature planes observed for low-dimensional datasets also occurs 
with higher-dimensional data. 

Finally, Figure [9] shows features learned by a two-layer rectified linear autoencoder with untied 
weights. This model was trained with just 64 first-layer hidden units, and 1024 second-layer units. 
As observed in the low-dimensional gaussian data, the features from the first layer appear to model 
the principal axes of the data, providing a bounding box of 64 dimensions that encodes the data for 
the next layer. The features from the second layer resemble a more traditional sparse code, consisting 
in this case of small segments of pen strokes. It is important to note that the training dataset was not 
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Figure 8: Some of the features computed using a single-layer rectified linear autoencoder with 
a 1.5x overcomplete dictionary. Each feature w is displayed on the left, and on the right is its 
corresponding maximally negative feature v = argmin^ w^u. 




(b) 1024 second-layer features. 



Figure 9: MNIST features computed using a two-layer rectified linear autoencoder. 



whitened beforehand, and no regularization was used to train the network — the autoencoder learned 
these features automatically, using only a squared error reconstruction cost. 



5 Conclusion 

This paper has synthesized many recent developments in encoding and neural networks, with a focus 
on rectified linear activation functions and whitening. It also presented an intuitive interpretation for 
the behavior of these encodings through simple visualizations. Sample features learned from both 
artificial and natural datasets provided examples of learning behavior when these algorithms are 
applied to different types of data. 

There are many more paths to explore in this area of machine learning research. Whitening has 
appeared in many places throughout this paper and seems to be a very important component of linear 
autoencoders and coding systems in general. However, this connection seems poorly understood, 
and would benefit from further exploration, particularly with respect to the idea of whitening local 
regions of the input space. 
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